Add matmul with transpose by junjihashimoto · Pull Request #35 · AnswerDotAI/gpu.cpp

junjihashimoto · 2024-07-30T12:38:38Z

This PR implements matrix multiplication with transposed weight.
In case of NVIDIA, it will be 1.5 times faster.

# In case of NVIDIA GeForce RTX 3080 Laptop GPU
$ yes |  MATMUL_VERSION=8  make  | grep 'Kernel version\|GFLOPS'
[info] Dispatching Kernel version 8: 2D blocktiling with loop unrolling and vectorization, 30 iterations ...
113.3 milliseconds / dispatch ~ 2426.99 GFLOPS
$ yes |  MATMUL_VERSION=9  make  | grep 'Kernel version\|GFLOPS'
[info] Dispatching Kernel version 9: 2D blocktiling with loop unrolling, vectorization and transpose, 30 iterations ...
74.7 milliseconds / dispatch ~ 3680.25 GFLOPS

# In case of M2 pro
$ yes |  MATMUL_VERSION=8  make  | grep 'Kernel version\|GFLOPS'
[info] Dispatching Kernel version 8: 2D blocktiling with loop unrolling and vectorization, 30 iterations ...
164.3 milliseconds / dispatch ~ 1672.82 GFLOPS
$ yes |  MATMUL_VERSION=9  make  | grep 'Kernel version\|GFLOPS'
[info] Dispatching Kernel version 9: 2D blocktiling with loop unrolling, vectorization and transpose, 30 iterations ...
160.7 milliseconds / dispatch ~ 1710.18 GFLOPS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add matmul with transpose#35

Add matmul with transpose#35
austinvhuang merged 1 commit intoAnswerDotAI:mainfrom
junjihashimoto:feature/transposed-matmul

junjihashimoto commented Jul 30, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

junjihashimoto commented Jul 30, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants